Companies are often forced to deal with attrition, being the choice of employees to voluntarily leave the workplace. It is intuitive to see how this phenomenon could potentially constitute a problem for any company: in fact, employee turnover can be expensive because recruitment, hiring and training expenses need to be covered and the risk of disrupting the workplace stability and productivity increases. Hence the need to propose a strategy to foresee and, if possible, prevent attrition and possibly mitigate the consequences.
This premise motivates the decision to analyze the “Employee Attrition and Factors” dataset, concerning the reasons that may lead employees to leave the company they work at. We are dealing with a classification problem: our objective is to predict whether someone ultimately leaves their job or not and the motivations behind.
Let us start our dissertation by importing the selected dataset.
HR_Analytics <- read.csv("~/Downloads/HR_Analytics.csv")
attach(HR_Analytics)
HR_Analytics <- HR_Analytics[, -c(4, 13)]
The chosen dataset is synthetically generated by IBM. It consists of 1470 instances, each corresponding to an employee of the company and described by 33 features, both numerical and categorical, consisting in the factors that may cause attrition. The column Attrition is a logical variable that indicates whether an employee has quit.
| Name of the variable | Type | Description |
|---|---|---|
| Age | Discrete | The age of the employee |
| Attrition | Nominal | Whether or not the employee has left the organization |
| BusinessTravel | Ordinal | The frequency of business travel for the employee |
| Department | Nominal | The department the employee works in |
| DistanceFromHome | Discrete | The distance from home in miles for the employee |
| Education | Ordinal | The level of education achieved by the employee |
| EducationField | Nominal | The field of study for the employee’s education |
| EmployeeCount | Discrete | The total number of employees in the organization |
| EmployeeNumber | Discrete | A unique identifier for each employee profile |
| EnvironmentSatisfaction | Ordinal | The employee’s satisfaction with their work environment |
| Gender | Nominal | The gender of the employee |
| JobInvolvement | Ordinal | The level of involvement required for the employee’s job |
| JobLevel | Ordinal | The job level of the employee |
| JobRole | Nominal | The role of the employee in the organization |
| JobSatisfaction | Ordinal | The employee’s satisfaction with their job |
| MaritalStatus | Nominal | The marital status of the employee |
| MonthlyIncome | Continuous | The monthly income of the employee |
| MonthlyRate | Continous | The monthly rate of pay for the employee |
| NumCompaniesWorked | Discrete | The number of companies the employee has worked for |
| Over18 | Nominal | Whether or not the employee is over 18 |
| OverTime | Nominal | Whether or not the employee works overtime |
| PercentSalaryHike | Discrete | The percentage of salary hike for the employee |
| PerformanceRating | Ordinal | The performance rating of the employee |
| RelationshipSatisfaction | Ordinal | The employee’s satisfaction with their relationships |
| StandardHours | Discrete | The standard hours of work for the employees |
| StockOptionLevel | Ordinal | The stock option level of the employee |
| TotalWorkingYears | Discrete | The total number of years the employee has worked |
| TrainingTimesLastYear | Discrete | The number of times the employee was taken for training in the last year |
| WorkLifeBalance | Ordinal | The employee’s perception of their work-life balance |
| YearsAtCompany | Discrete | The number of years the employee has been with the company |
| YearsInCurrentRole | Discrete | The number of years the employee has been in their current role |
| YearsSinceLastPromotion | Discrete | The number of years since the employee’s last promotion |
| YearsWithCurrManager | Discrete | The number of years the employee has been with their current manager |
Our data analysis begins with the search for null values and missing data:
sum(is.na.data.frame(HR_Analytics))
## [1] 0
sum(duplicated(HR_Analytics))
## [1] 0
Taking a closer look to the features, we observed:
the presence of the column EmployeeNumber that does not give any substantial contribute to the analysis, as it is merely used to keep count of the employees;
the presence of the columns Over18, EmployeeCount, StandardHours that only take a single value each, making them uninformative. For example:
unique(Over18)
## [1] "Y"
HR_Analytics[6,c("NumCompaniesWorked", "TotalWorkingYears", "YearsAtCompany")]
## NumCompaniesWorked TotalWorkingYears YearsAtCompany
## 6 0 8 7
Due to these reasons, we decide to remove the columns EmployeeNumber, Over18, EmployeeCount and StandardHours from our dataset. Additionaly, we removed any instances where the NumCompaniesWorked value was equal to 0.
HR_Analytics <- HR_Analytics[NumCompaniesWorked!= 0,]
HR_Analytics <- HR_Analytics[, -c(8, 9, 18, 25, 20)]
We now want to figure out how the features in the HR_Analytics dataset are distributed. Firstly, we take a look at the summary of it.
summary(HR_Analytics)
## Age Attrition BusinessTravel Department
## Min. :18.00 Length:1273 Length:1273 Length:1273
## 1st Qu.:30.00 Class :character Class :character Class :character
## Median :36.00 Mode :character Mode :character Mode :character
## Mean :37.34
## 3rd Qu.:44.00
## Max. :60.00
## DistanceFromHome Education EducationField EnvironmentSatisfaction
## Min. : 1.000 Min. :1.000 Length:1273 Min. :1.00
## 1st Qu.: 2.000 1st Qu.:2.000 Class :character 1st Qu.:2.00
## Median : 7.000 Median :3.000 Mode :character Median :3.00
## Mean : 9.297 Mean :2.933 Mean :2.72
## 3rd Qu.:14.000 3rd Qu.:4.000 3rd Qu.:4.00
## Max. :29.000 Max. :5.000 Max. :4.00
## Gender JobInvolvement JobLevel JobRole
## Length:1273 Min. :1.000 Min. :1.00 Length:1273
## Class :character 1st Qu.:2.000 1st Qu.:1.00 Class :character
## Mode :character Median :3.000 Median :2.00 Mode :character
## Mean :2.727 Mean :2.09
## 3rd Qu.:3.000 3rd Qu.:3.00
## Max. :4.000 Max. :5.00
## JobSatisfaction MaritalStatus MonthlyIncome NumCompaniesWorked
## Min. :1.000 Length:1273 Min. : 1009 Min. :1.00
## 1st Qu.:2.000 Class :character 1st Qu.: 2911 1st Qu.:1.00
## Median :3.000 Mode :character Median : 5067 Median :2.00
## Mean :2.723 Mean : 6611 Mean :3.11
## 3rd Qu.:4.000 3rd Qu.: 8621 3rd Qu.:4.00
## Max. :4.000 Max. :19973 Max. :9.00
## OverTime PercentSalaryHike PerformanceRating
## Length:1273 Min. :11.00 Min. :3.000
## Class :character 1st Qu.:12.00 1st Qu.:3.000
## Mode :character Median :14.00 Median :3.000
## Mean :15.19 Mean :3.156
## 3rd Qu.:18.00 3rd Qu.:3.000
## Max. :25.00 Max. :4.000
## RelationshipSatisfaction StockOptionLevel TotalWorkingYears
## Min. :1.000 Min. :0.0000 Min. : 0.00
## 1st Qu.:2.000 1st Qu.:0.0000 1st Qu.: 6.00
## Median :3.000 Median :1.0000 Median :10.00
## Mean :2.727 Mean :0.8013 Mean :11.59
## 3rd Qu.:4.000 3rd Qu.:1.0000 3rd Qu.:16.00
## Max. :4.000 Max. :3.0000 Max. :40.00
## TrainingTimesLastYear WorkLifeBalance YearsAtCompany YearsInCurrentRole
## Min. :0.000 Min. :1.000 Min. : 0.000 Min. : 0.000
## 1st Qu.:2.000 1st Qu.:2.000 1st Qu.: 2.000 1st Qu.: 2.000
## Median :3.000 Median :3.000 Median : 5.000 Median : 3.000
## Mean :2.787 Mean :2.768 Mean : 6.815 Mean : 4.147
## 3rd Qu.:3.000 3rd Qu.:3.000 3rd Qu.:10.000 3rd Qu.: 7.000
## Max. :6.000 Max. :4.000 Max. :40.000 Max. :18.000
## YearsSinceLastPromotion YearsWithCurrManager
## Min. : 0.000 Min. : 0.000
## 1st Qu.: 0.000 1st Qu.: 2.000
## Median : 1.000 Median : 3.000
## Mean : 2.166 Mean : 4.019
## 3rd Qu.: 2.000 3rd Qu.: 7.000
## Max. :15.000 Max. :17.000
table(HR_Analytics$Attrition)
##
## No Yes
## 1059 214
The distribution of the target variable Attrition is unbalanced, with approximately 84% of employees staying at the company and the remaining 16% leaving.
Nominal categorical variables
Ordinal variables
Numerical (continuous and discrete) variables
We observe that the outliers in MonthlyIncome are entirely justifiable, as they correspond to higher-level positions (such as Research Director and Manager) with significant responsibilities, resulting in higher earnings.
We want to explore the distribution of all the features concerning
Attrition to obtain a general overview and formulate meaningful
questions to identify potential patterns.
To better understand the relationships among the numerical variables in
the dataset, we aim to visualize these relationships by creating a
correlation matrix.
Here Cramer’s V correlation among categorical (nominal and ordinal) variables with respect to the target variable Attrition.
## CRAMER'S V
## [1] "Attrition-BusinessTravel 0.1184"
## [1] "Attrition-Department 0.09307"
## [1] "Attrition-EducationField 0.1029"
## [1] "Attrition-Gender 0.04597"
## [1] "Attrition-MaritalStatus 0.1802"
## [1] "Attrition-OverTime 0.2475"
## [1] "Attrition-JobRole 0.2623"
## [1] "Attrition-Education 0.06153"
## [1] "Attrition-JobLevel 0.2326"
## [1] "Attrition-StockOptionLevel 0.2134"
## [1] "Attrition-PerformanceRating 0.008406"
## [1] "Attrition-JobInvolvement 0.145"
## [1] "Attrition-JobSatisfaction 0.09588"
## [1] "Attrition-EnvironmentSatisfaction 0.1504"
## [1] "Attrition-RelationshipSatisfaction 0.05398"
The variables most strongly correlated with Attrition are OverTime, JobRole, JobLevel, StockOptionLevel.
From the box plots, it’s evident that employees who leave the company tend to be younger than their counterparts who choose to stay. This aligns with recent studies indicating that the ‘Z’ and ‘Millennial’ generations are more inclined to change jobs frequently compared to previous generations. Additionally, we’ve noticed that among those who leave, female employees are generally younger than their male colleagues. This led us to wonder whether there’s an imbalanced gender distribution between those who leave and those who stay. However, based on the following plot, it’s clear that gender is not a defining factor in shaping the typical profile of those who decide to leave.
We’re interested in examining how marital status is distributed among employees in the context of attrition. It’s worth highlighting that a substantial 50% of those who choose to leave the company are single employees.
Let’s create a visualization that considers both age and other factors to gain a deeper insight into the typical profile of employees who decide to leave.
Single employees, particularly those in the younger age groups (under 30), demonstrate a greater likelihood of leaving the company when compared to individuals with different marital statuses.
Now, we want to visualize the distribution of BusinessTravel among the two Attrition classes.
<img src="EmployeeAttritionHTML_files/figure-html/unnamed-chunk-40-1.png" width="672" />
It appears that there is a higher presence of employees who travel frequently in the Attrition=Yes class compared to the opposite class (30% vs 15%).
We can observe that, among the groups represented, the highest percentage of attrition is among single employees who travel frequently, followed by single employees who travel rarely. Divorced employees who travel frequently also show a notable percentage of attrition.
In the variables Job Satisfaction, Environment Satisfaction, Relationship Satisfaction, Work-Life Balance, and Job Involvement, we find indicators of employee well-being, contentment, and comfort within the company. Given this, and assuming that these variables may play a role in the decision to leave, we aim to investigate whether there is a significant relationship between each of these variables and the “Attrition” variable.
Before exploring this analysis, we want to provide an initial overview of the overall employee satisfaction among those who choose to stay compared to those who decide to leave. To do this, we calculate the averages of the HR_Analytics Satisfaction variables (Job, Environment, Relationship) and examine their distributions.
Both boxplots share an identical median; however, there is a notable contrast in the spread of the data, evident from the second and third quartiles. This indicates that individuals who leave tend to have an overall lower satisfaction level.
To evaluate the relationships and dependencies among the mentioned features (Job Satisfaction, Environment Satisfaction, Relationship Satisfaction, Work-Life Balance, and Job Involvement), we conducted a statistical analysis using the chi-square test. The chi-square test is a robust tool for examining associations between categorical variables and helps us determine whether there is a significant statistical relationship between these variables and Attrition. Here, we present the p-value results.
```
## [1] "Attrition-JobSatisfaction p-value < 0.00847275109964339"
```
```
## [1] "Attrition-EnvironmentSatisfaction p-value < 0.00000248436040568772"
```
```
## [1] "Attrition-RelationshipSatisfaction p-value < 0.294647700977509"
```
```
## [1] "Attrition-WorkLifeBalance p-value < 0.000385782940897055"
```
```
## [1] "Attrition-JobInvolvement p-value < 0.00000656457066653085"
```
So, for all the considered variables (except for
RelationshipSatisfaction) the p-value is very small (far from
0.05), suggesting that we have evidence to reject the null hypothesis.
We can say that there is a statistically significant association between
Attrition and Job Satisfaction, Environment
Satisfaction, Work-Life Balance, and *Job
Involvement.
In light of these results, we now want to examine how the overall view of satisfaction changes when considering the average solely between JobSatisfaction and EnvironmentSatisfaction, expecting an even clearer and more distinct interpretation than before.
<img src="EmployeeAttritionHTML_files/figure-html/unnamed-chunk-44-1.png" width="672" />
As we’ve observed from the overall view of Attrition compared to each variable, it’s evident that employees who work overtime exhibit a significantly higher Attrition rate, which stands at 28%. Furthermore, we can see that more than 50% of employees who leave are those who work overtime.
Now, we want to conduct a chi-square test to analyze the relationship between these two variables.
```
##
## Pearson's Chi-squared test with Yates' continuity correction
##
## data: table(HR_Analytics$Attrition, HR_Analytics$OverTime)
## X-squared = 76.501, df = 1, p-value < 0.00000000000000022
```
Due to the very small value of p-value we can reject the null hypothesis and statistically confirm that *OverTime* and *Attrition* are strongly associated.
Considering the previous plots that depict the distribution of Monthly Income in relation to Attrition, it becomes apparent that individuals who decide to leave typically earn less.
We intend to confirm this by conducting a Wilcoxon rank sum test to compare the distribution of Monthly Income (a numerical variable) between the two distinct groups defined by Attrition.
##
## Wilcoxon rank sum test with continuity correction
##
## data: MonthlyIncome by Attrition
## W = 149608, p-value = 0.0000000000001368
## alternative hypothesis: true location shift is not equal to 0
Due to the small p-value that we have obtained we can reject the null hypotehsis indicating that there is strong evidence of a significant difference in monthly income between employees who have experienced attrition and those who have not.
Having examined potential factors that could influence the decision to leave, our next step is to explore these factors across different Job Role to identify areas where intervention may be necessary.
To begin, we’ll illustrate how Attrition is distributed among various
job roles.
It seems that the most critical roles are Sales Representative (with approximately 40% of Attrition), Human Resources and Laboratory Technician.
Now, we want to visualize whether some of these roles are more significantly affected by the “younger employee attrition” phenomenon.
We observe that younger employees who choose to leave are predominantly found among the roles of Sales Representative, Human Resources, Laboratory Technician.
We can notice that the most critical salary-related issues manifest in the following roles: Sales Representative, Research Scientist, Laboratory Technician, and Human Resources. In contrast, an opposite trend is evident in the roles of Research Director and Healthcare Representative.
We want to determine if there are roles in which the level of satisfaction, as measured by the mean of Job Satisfaction and Environment Satisfaction, is lower compared to others.
With the exception of the Research Director, it appears that those who leave the company tend to have lower overall satisfaction. Notably, this trend is particularly pronounced in roles such as Human Resources, Healthcare Representative, and Sales Representative.
The following plot illustrates that employees who work overtime and decide to leave are predominantly found among Sales Representatives, Laboratory Technicians, and Human Resources.
Our intent is to predict whether an employee leaves or not the workplace. Therefore, we decided to use the following classification models:
Notice that it is necessary to encode categorical variables so that they can be considered numeric and hence be used in the models: we did so by converting to numeric the levels of such columns.
Let us define a function that allows to rapidly and effectively compute metrics such as accuracy, precision, recall, specificity and F1 score, given the respective confusion matrix.
compute_metrics <- function(confusion_matrix) {
tp <- confusion_matrix[2, 2]
tn <- confusion_matrix[1, 1]
fp <- confusion_matrix[2, 1]
fn <- confusion_matrix[1, 2]
accuracy <- (tp + tn) / sum(confusion_matrix)
precision <- tp / (tp + fp)
recall <- tp / (tp + fn)
specificity <- tn / (tn + fp)
f1_score <- 2 * (precision * recall) / (precision + recall)
metrics = data.frame(metric = c("Accuracy", "Precision", "Recall", "Specificity", "F1 score"),
values = c(accuracy, precision, recall, specificity, f1_score))
print(t(metrics))
}
We reckon that our priority is to detect people that intend to leave the workplace, therefore the metric we aim to maximize is recall. In fact, we are less concerned to misclassify an employee that doesn’t actually plan on leaving the company than missing someone who intents to quit. Also, as we are dealing with an unbalanced dataset, we are going to prioritize the F1-score metric over accuracy, as the former factors in, by definition, both recall and precision, thus taking in consideration the prediction performance of both classes.
Lastly, we split our dataset in training set and test set with a 75/25 ratio, setting a seed for reproducibility purposes.
set.seed(108)
n <- nrow(HR_Analytics)
train <- sample(1:n, round(n*0.75))
test <- setdiff(1:n, train)
train.data <- HR_Analytics[train, ]
test.data <- HR_Analytics[test, ]
table(test.data$Attrition)
##
## 0 1
## 269 49
The first model used is Logistic Regression.
full.model <- glm(train.data$Attrition ~ ., data = train.data, family = binomial)
summary(full.model)
##
## Call:
## glm(formula = train.data$Attrition ~ ., family = binomial, data = train.data)
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 5.639564717 1.479661601 3.811 0.000138
## Age -0.026071195 0.016149084 -1.614 0.106439
## BusinessTravel 0.977684544 0.213482294 4.580 0.00000466
## Department -0.804382863 0.225850155 -3.562 0.000369
## DistanceFromHome 0.038723594 0.013335771 2.904 0.003687
## Education -0.114273331 0.106080303 -1.077 0.281376
## EducationField 0.146811797 0.076331278 1.923 0.054436
## EnvironmentSatisfaction -0.436455313 0.101031694 -4.320 0.00001560
## Gender -0.583007162 0.228740128 -2.549 0.010810
## JobInvolvement -0.521785893 0.147591426 -3.535 0.000407
## JobLevel -0.581758995 0.346956885 -1.677 0.093592
## JobRole 0.078327691 0.053558574 1.462 0.143613
## JobSatisfaction -0.368926405 0.101850617 -3.622 0.000292
## MaritalStatus -0.550908340 0.199859765 -2.756 0.005843
## MonthlyIncome -0.000001702 0.000080807 -0.021 0.983191
## NumCompaniesWorked 0.189292061 0.048271537 3.921 0.00008804
## OverTime 2.094202844 0.233260253 8.978 < 0.0000000000000002
## PercentSalaryHike -0.064494040 0.047736316 -1.351 0.176680
## PerformanceRating 0.209080116 0.489693979 0.427 0.669408
## RelationshipSatisfaction -0.175882249 0.102533741 -1.715 0.086279
## StockOptionLevel -0.214702940 0.174292000 -1.232 0.218002
## TotalWorkingYears -0.066036861 0.033291112 -1.984 0.047298
## TrainingTimesLastYear -0.158876646 0.091248738 -1.741 0.081659
## WorkLifeBalance -0.541483996 0.148526190 -3.646 0.000267
## YearsAtCompany 0.128179768 0.043174647 2.969 0.002989
## YearsInCurrentRole -0.149498746 0.055918660 -2.674 0.007506
## YearsSinceLastPromotion 0.149768408 0.050488996 2.966 0.003014
## YearsWithCurrManager -0.151385099 0.057652148 -2.626 0.008644
##
## (Intercept) ***
## Age
## BusinessTravel ***
## Department ***
## DistanceFromHome **
## Education
## EducationField .
## EnvironmentSatisfaction ***
## Gender *
## JobInvolvement ***
## JobLevel .
## JobRole
## JobSatisfaction ***
## MaritalStatus **
## MonthlyIncome
## NumCompaniesWorked ***
## OverTime ***
## PercentSalaryHike
## PerformanceRating
## RelationshipSatisfaction .
## StockOptionLevel
## TotalWorkingYears *
## TrainingTimesLastYear .
## WorkLifeBalance ***
## YearsAtCompany **
## YearsInCurrentRole **
## YearsSinceLastPromotion **
## YearsWithCurrManager **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 879.09 on 954 degrees of freedom
## Residual deviance: 576.27 on 927 degrees of freedom
## AIC: 632.27
##
## Number of Fisher Scoring iterations: 6
We can test different thresholds, minding that theoretically a smaller threshold is fitted to improve recall. We are trying out 0.4, 0.5 and 0.6.
predicted <- predict(full.model, newdata = test.data, type = "response")
for (t in c(0.4, 0.5, 0.6)){
predicted.classes <- ifelse(predicted > t, 1, 0)
confusion.matrix <- table(predicted.classes, test.data$Attrition)
print(confusion.matrix)
compute_metrics(confusion.matrix)
}
##
## predicted.classes 0 1
## 0 243 27
## 1 26 22
## [,1] [,2] [,3] [,4] [,5]
## metric "Accuracy" "Precision" "Recall" "Specificity" "F1 score"
## values "0.8333333" "0.4583333" "0.4489796" "0.9033457" "0.4536082"
##
## predicted.classes 0 1
## 0 258 31
## 1 11 18
## [,1] [,2] [,3] [,4] [,5]
## metric "Accuracy" "Precision" "Recall" "Specificity" "F1 score"
## values "0.8679245" "0.6206897" "0.3673469" "0.9591078" "0.4615385"
##
## predicted.classes 0 1
## 0 264 33
## 1 5 16
## [,1] [,2] [,3] [,4] [,5]
## metric "Accuracy" "Precision" "Recall" "Specificity" "F1 score"
## values "0.8805031" "0.7619048" "0.3265306" "0.9814126" "0.4571429"
As we could have foreseen, a threshold of 0.4 best fits our purpose. This way, our model will tend to easily classify more instances as 1. Nevertheless, we think that such results can be further improved.
To assess collinearity in the model, we think it is advisable to compute the Variance Inflation Factor (VIF), reputing any variable with VIF over 5 problematic. In fact, a high VIF for a variable indicates that it is strongly correlated with other independent variables in the model and this can lead to unstable coefficient estimates and decreased interpretability.
vif(full.model)
## Age BusinessTravel Department
## 1.981978 1.114924 1.286799
## DistanceFromHome Education EducationField
## 1.084124 1.084458 1.061957
## EnvironmentSatisfaction Gender JobInvolvement
## 1.089613 1.061261 1.059996
## JobLevel JobRole JobSatisfaction
## 8.943373 1.287902 1.083157
## MaritalStatus MonthlyIncome NumCompaniesWorked
## 1.857679 8.121151 1.369443
## OverTime PercentSalaryHike PerformanceRating
## 1.202241 2.562508 2.552855
## RelationshipSatisfaction StockOptionLevel TotalWorkingYears
## 1.108647 1.822890 4.477195
## TrainingTimesLastYear WorkLifeBalance YearsAtCompany
## 1.061533 1.106284 5.368819
## YearsInCurrentRole YearsSinceLastPromotion YearsWithCurrManager
## 2.756052 2.327484 2.749695
It’s clear that there are only a few troublesome collinear variables having VIF values over 5. We have procede to eliminate them one by one from our model, ultimately producing the following model where the variables JobLevel and YearsAtCompany are gone.
full.model.1 <- glm(train.data$Attrition ~ . - JobLevel - YearsAtCompany, data = train.data, family = binomial)
summary(full.model.1)
##
## Call:
## glm(formula = train.data$Attrition ~ . - JobLevel - YearsAtCompany,
## family = binomial, data = train.data)
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 5.3738502 1.4509513 3.704 0.000212 ***
## Age -0.0287792 0.0159886 -1.800 0.071863 .
## BusinessTravel 0.9459034 0.2107574 4.488 0.00000719 ***
## Department -0.6801298 0.2204704 -3.085 0.002036 **
## DistanceFromHome 0.0354978 0.0130315 2.724 0.006449 **
## Education -0.1245191 0.1053929 -1.181 0.237414
## EducationField 0.1379003 0.0753363 1.830 0.067181 .
## EnvironmentSatisfaction -0.4361269 0.1000806 -4.358 0.00001314 ***
## Gender -0.5756776 0.2262071 -2.545 0.010930 *
## JobInvolvement -0.5402624 0.1459430 -3.702 0.000214 ***
## JobRole 0.0811444 0.0531546 1.527 0.126867
## JobSatisfaction -0.3251737 0.0990701 -3.282 0.001030 **
## MaritalStatus -0.5289013 0.1976041 -2.677 0.007438 **
## MonthlyIncome -0.0001017 0.0000431 -2.360 0.018261 *
## NumCompaniesWorked 0.1628835 0.0472813 3.445 0.000571 ***
## OverTime 2.0655622 0.2304888 8.962 < 0.0000000000000002 ***
## PercentSalaryHike -0.0609106 0.0478049 -1.274 0.202611
## PerformanceRating 0.1682054 0.4885533 0.344 0.730626
## RelationshipSatisfaction -0.1716181 0.1019241 -1.684 0.092223 .
## StockOptionLevel -0.2185963 0.1736300 -1.259 0.208038
## TotalWorkingYears -0.0510323 0.0300978 -1.696 0.089971 .
## TrainingTimesLastYear -0.1519060 0.0897364 -1.693 0.090493 .
## WorkLifeBalance -0.5275943 0.1467449 -3.595 0.000324 ***
## YearsInCurrentRole -0.0932801 0.0531587 -1.755 0.079302 .
## YearsSinceLastPromotion 0.2010928 0.0481103 4.180 0.00002917 ***
## YearsWithCurrManager -0.0884568 0.0559007 -1.582 0.113560
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 879.09 on 954 degrees of freedom
## Residual deviance: 587.00 on 929 degrees of freedom
## AIC: 639
##
## Number of Fisher Scoring iterations: 6
We can notice from the summary that there are still a lot of non significant variables (with a high p-value) in this model, but we can further improve this using a feature selection approach: we tested both forward and backward algorithms, using the function step.
In the forward feature selection algorithm, we start with a model with only one variable and progressively add one variable at a time so that the model performance is improved. On the other hand, the backward algorithm consists of iteratively removing one variable at a time starting from the complete model in such a way that the AIC keeps decreasing. AIC stands for Akaike Information Criterion and statistically quantifies the compromise between a good fitting model and its complexity.
We have observed that the backward selection algorithm provides a more performing model with a lower AIC, so that is the algorithm that we are going to keep using for the entire project.
summary(full.model.step)
##
## Call:
## glm(formula = train.data$Attrition ~ Age + BusinessTravel + Department +
## DistanceFromHome + EducationField + EnvironmentSatisfaction +
## Gender + JobInvolvement + JobRole + JobSatisfaction + MaritalStatus +
## MonthlyIncome + NumCompaniesWorked + OverTime + PercentSalaryHike +
## RelationshipSatisfaction + TotalWorkingYears + TrainingTimesLastYear +
## WorkLifeBalance + YearsInCurrentRole + YearsSinceLastPromotion +
## YearsWithCurrManager, family = binomial, data = train.data)
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 5.47779689 1.03771114 5.279 0.00000013
## Age -0.03141510 0.01587514 -1.979 0.047829
## BusinessTravel 0.93356313 0.20889928 4.469 0.00000786
## Department -0.68574034 0.21995900 -3.118 0.001823
## DistanceFromHome 0.03344059 0.01289453 2.593 0.009503
## EducationField 0.13571811 0.07513388 1.806 0.070864
## EnvironmentSatisfaction -0.41774453 0.09907358 -4.217 0.00002481
## Gender -0.56585144 0.22478896 -2.517 0.011827
## JobInvolvement -0.54840057 0.14499312 -3.782 0.000155
## JobRole 0.08456021 0.05316170 1.591 0.111694
## JobSatisfaction -0.32316086 0.09841699 -3.284 0.001025
## MaritalStatus -0.69416248 0.15259129 -4.549 0.00000539
## MonthlyIncome -0.00010191 0.00004299 -2.371 0.017757
## NumCompaniesWorked 0.15569052 0.04675489 3.330 0.000869
## OverTime 2.05679276 0.22919926 8.974 < 0.0000000000000002
## PercentSalaryHike -0.04826242 0.03028913 -1.593 0.111073
## RelationshipSatisfaction -0.16127034 0.10096136 -1.597 0.110188
## TotalWorkingYears -0.04941084 0.03003061 -1.645 0.099898
## TrainingTimesLastYear -0.15494378 0.08986337 -1.724 0.084669
## WorkLifeBalance -0.54247129 0.14618108 -3.711 0.000206
## YearsInCurrentRole -0.09313771 0.05284844 -1.762 0.078009
## YearsSinceLastPromotion 0.19775349 0.04783774 4.134 0.00003568
## YearsWithCurrManager -0.09085149 0.05570928 -1.631 0.102930
##
## (Intercept) ***
## Age *
## BusinessTravel ***
## Department **
## DistanceFromHome **
## EducationField .
## EnvironmentSatisfaction ***
## Gender *
## JobInvolvement ***
## JobRole
## JobSatisfaction **
## MaritalStatus ***
## MonthlyIncome *
## NumCompaniesWorked ***
## OverTime ***
## PercentSalaryHike
## RelationshipSatisfaction
## TotalWorkingYears .
## TrainingTimesLastYear .
## WorkLifeBalance ***
## YearsInCurrentRole .
## YearsSinceLastPromotion ***
## YearsWithCurrManager
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 879.09 on 954 degrees of freedom
## Residual deviance: 590.36 on 932 degrees of freedom
## AIC: 636.36
##
## Number of Fisher Scoring iterations: 6
The function step has produced a model in which variables Education, PerformanceRating and StockOptionLevel are excluded, whereas the most significant features are:
In particular, OverTime has a significantly smaller p-value (<2e-16) compared to the other predictor variables.
From now on, for every model we are still going to try out different thresholds (choosing from 0.4, 0.5, 0.6), but for brevity purposes we are going to print directly the best model out of the three. Here follows the confusion matrix and the relative metrics obtained by the model, using a threshold of 0.4:
##
## predicted.classes.step 0 1
## 0 248 28
## 1 21 21
## [,1] [,2] [,3] [,4] [,5]
## metric "Accuracy" "Precision" "Recall" "Specificity" "F1 score"
## values "0.8459119" "0.5000000" "0.4285714" "0.9219331" "0.4615385"
df_metrics <- data.frame(Model_name = character(),
Accuracy = numeric(),
Precision = numeric(),
Recall = numeric(),
F1_score = numeric(),
PR_AUC = numeric())
colnames(df_metrics) <- c("Model_name", "Accuracy", "Precision", "Recall", "F1_score", "PR_AUC")
Model_name <- "Logistic Regression Unbalanced"
accuracy <- as.numeric(metrics_lr_unb[2, which(metrics_lr_unb[1, ] == "Accuracy")])
precision <- as.numeric(metrics_lr_unb[2, which(metrics_lr_unb[1, ] == "Precision")])
recall <- as.numeric(metrics_lr_unb[2, which(metrics_lr_unb[1, ] == "Recall")])
specificity <- as.numeric(metrics_lr_unb[2, which(metrics_lr_unb[1, ] == "Specificity")])
f1_score <- as.numeric(metrics_lr_unb[2, which(metrics_lr_unb[1, ] == "F1 score")])
class0_indices = which(test.data$Attrition == 0)
pr_curve_glm_unb <- pr.curve(scores.class0 = predicted.classes.step[class0_indices], scores.class1 = predicted.classes.step[-class0_indices], curve=TRUE, max.compute = T, min.compute = T, rand.compute = T)
pr_auc_unb <- as.numeric(pr_curve_glm_unb$auc.integral)
new_row <- data.frame(Model_name = Model_name,
Accuracy = accuracy,
Precision = precision,
Recall = recall,
F1_score = f1_score,
PR_AUC = pr_auc_unb)
df_metrics <- rbind(df_metrics, new_row)
Since we are working with an unbalanced dataset, it’s recommended to employ sampling techniques to address the class imbalance issue. This approach will enable the model to capture patterns from the minority class, which is crucial, as our main objective is to accurately detect individuals who intend to leave. We’ll focus solely on balancing the training set using the ovun.sample function and the SMOTE method while maintaining the test set’s original ratio. Furthermore, we continue to take into account the considerations derived from the VIF values of the full model.
Keep in mind that we are going to thoroughly address every sampling technique specifically for logistic regression. In fact, we have come to the conclusion that oversampling produces the most performant models. Therefore, we will only focus on unbalanced and oversampled training sets for the rest of the models.
We initially opted for undersampling our data using the ovun.sample function. However, we eventually decided to discard this approach because it resulted in a training set even smaller than the test set. In the end, we chose to sample the training set while maintaining its original size, ensuring that the two classes are perfectly balanced with a probability of p = 0.5.
under.train.data <- ovun.sample(Attrition ~ ., data = train.data, N = nrow(train.data), p = 0.5, seed=108)$data
The training data is now split in the following way:
table(under.train.data$Attrition)
##
## 0 1
## 480 475
Here follows the application of the Logistic Regression model on the new training set.
full.model.under <- glm(under.train.data$Attrition ~ . - JobLevel - YearsAtCompany, data = under.train.data, family = binomial)
summary(full.model.under)
##
## Call:
## glm(formula = under.train.data$Attrition ~ . - JobLevel - YearsAtCompany,
## family = binomial, data = under.train.data)
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 6.46010374 1.24111229 5.205 0.00000019390
## Age -0.01587964 0.01282487 -1.238 0.215645
## BusinessTravel 1.07513350 0.18140156 5.927 0.00000000309
## Department -0.42766584 0.18280250 -2.339 0.019310
## DistanceFromHome 0.02708938 0.01105806 2.450 0.014296
## Education -0.20376032 0.09285701 -2.194 0.028211
## EducationField 0.17634204 0.06125771 2.879 0.003993
## EnvironmentSatisfaction -0.31050304 0.07995386 -3.884 0.000103
## Gender -0.33431166 0.18882197 -1.771 0.076642
## JobInvolvement -0.49621082 0.11944054 -4.154 0.00003260582
## JobRole 0.03627219 0.04572943 0.793 0.427666
## JobSatisfaction -0.43443229 0.08065933 -5.386 0.00000007204
## MaritalStatus -0.66325200 0.16759086 -3.958 0.00007571717
## MonthlyIncome -0.00008609 0.00003380 -2.547 0.010860
## NumCompaniesWorked 0.12644992 0.03754910 3.368 0.000758
## OverTime 2.06213804 0.19421566 10.618 < 0.0000000000000002
## PercentSalaryHike -0.12916719 0.04077693 -3.168 0.001537
## PerformanceRating 0.51731295 0.41191643 1.256 0.209164
## RelationshipSatisfaction -0.16175836 0.08440645 -1.916 0.055311
## StockOptionLevel -0.02886215 0.14288159 -0.202 0.839916
## TotalWorkingYears -0.03596412 0.02321043 -1.549 0.121266
## TrainingTimesLastYear -0.25877777 0.08008123 -3.231 0.001232
## WorkLifeBalance -0.51773725 0.12215665 -4.238 0.00002252126
## YearsInCurrentRole -0.16933580 0.04634446 -3.654 0.000258
## YearsSinceLastPromotion 0.19339453 0.03945852 4.901 0.00000095248
## YearsWithCurrManager -0.05342510 0.04487224 -1.191 0.233809
##
## (Intercept) ***
## Age
## BusinessTravel ***
## Department *
## DistanceFromHome *
## Education *
## EducationField **
## EnvironmentSatisfaction ***
## Gender .
## JobInvolvement ***
## JobRole
## JobSatisfaction ***
## MaritalStatus ***
## MonthlyIncome *
## NumCompaniesWorked ***
## OverTime ***
## PercentSalaryHike **
## PerformanceRating
## RelationshipSatisfaction .
## StockOptionLevel
## TotalWorkingYears
## TrainingTimesLastYear **
## WorkLifeBalance ***
## YearsInCurrentRole ***
## YearsSinceLastPromotion ***
## YearsWithCurrManager
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 1323.88 on 954 degrees of freedom
## Residual deviance: 824.66 on 929 degrees of freedom
## AIC: 876.66
##
## Number of Fisher Scoring iterations: 5
Notice that this model has a higher AIC than the previous models. The features with 0 significance level are the following:
BusinessTravel
EnvironmentSatisfaction
JobInvolvement
JobSatisfation
MaritalStatus
NumCompaniesWorked
OverTime
WorkLifeBalance
YearsInCurrentRole
YearsSinceLastPromotion
##
## predicted.classes.under 0 1
## 0 177 9
## 1 92 40
## [,1] [,2] [,3] [,4] [,5]
## metric "Accuracy" "Precision" "Recall" "Specificity" "F1 score"
## values "0.6823899" "0.3030303" "0.8163265" "0.6579926" "0.4419890"
Evidently, the chosen threshold of 0.4 provides a reasonable compromise between recall and F1-score. According to this model, the two most significant features are OverTime and Department.
As previously, we will select some features, and as expected, a backward approach is preferred.
summary(model.step.under)
##
## Call:
## glm(formula = under.train.data$Attrition ~ BusinessTravel + Department +
## DistanceFromHome + Education + EducationField + EnvironmentSatisfaction +
## Gender + JobInvolvement + JobSatisfaction + MaritalStatus +
## MonthlyIncome + NumCompaniesWorked + OverTime + PercentSalaryHike +
## RelationshipSatisfaction + TotalWorkingYears + TrainingTimesLastYear +
## WorkLifeBalance + YearsInCurrentRole + YearsSinceLastPromotion,
## family = binomial, data = under.train.data)
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 7.25033727 0.87490609 8.287 < 0.0000000000000002
## BusinessTravel 1.06623630 0.17705730 6.022 0.00000000172
## Department -0.36490435 0.16731124 -2.181 0.029184
## DistanceFromHome 0.02403790 0.01093905 2.197 0.027989
## Education -0.22338595 0.09174487 -2.435 0.014898
## EducationField 0.18498218 0.05955102 3.106 0.001895
## EnvironmentSatisfaction -0.31382789 0.07929137 -3.958 0.00007560935
## Gender -0.30011663 0.18576233 -1.616 0.106182
## JobInvolvement -0.50971128 0.11870425 -4.294 0.00001755143
## JobSatisfaction -0.43759959 0.07966957 -5.493 0.00000003959
## MaritalStatus -0.67487960 0.12787788 -5.278 0.00000013094
## MonthlyIncome -0.00008727 0.00003300 -2.645 0.008172
## NumCompaniesWorked 0.11790024 0.03692251 3.193 0.001407
## OverTime 2.07575450 0.19268544 10.773 < 0.0000000000000002
## PercentSalaryHike -0.09162363 0.02492224 -3.676 0.000237
## RelationshipSatisfaction -0.16989380 0.08220426 -2.067 0.038760
## TotalWorkingYears -0.05127510 0.02040634 -2.513 0.011981
## TrainingTimesLastYear -0.27444513 0.07847603 -3.497 0.000470
## WorkLifeBalance -0.50123394 0.12016514 -4.171 0.00003029876
## YearsInCurrentRole -0.19962263 0.03687695 -5.413 0.00000006191
## YearsSinceLastPromotion 0.18319257 0.03894127 4.704 0.00000254701
##
## (Intercept) ***
## BusinessTravel ***
## Department *
## DistanceFromHome *
## Education *
## EducationField **
## EnvironmentSatisfaction ***
## Gender
## JobInvolvement ***
## JobSatisfaction ***
## MaritalStatus ***
## MonthlyIncome **
## NumCompaniesWorked **
## OverTime ***
## PercentSalaryHike ***
## RelationshipSatisfaction *
## TotalWorkingYears *
## TrainingTimesLastYear ***
## WorkLifeBalance ***
## YearsInCurrentRole ***
## YearsSinceLastPromotion ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 1323.9 on 954 degrees of freedom
## Residual deviance: 829.9 on 934 degrees of freedom
## AIC: 871.9
##
## Number of Fisher Scoring iterations: 5
We can see that the function step has removed in the model model.step.under the features:
Age
JobRole
StockOptionLevel
YearsWithCurrentManager
It yields the following results using a threshold of 0.4:
print(confusion.matrix.step.under)
##
## predicted.classes.step.under 0 1
## 0 174 9
## 1 95 40
compute_metrics(confusion.matrix.step.under)
## [,1] [,2] [,3] [,4] [,5]
## metric "Accuracy" "Precision" "Recall" "Specificity" "F1 score"
## values "0.6729560" "0.2962963" "0.8163265" "0.6468401" "0.4347826"
class0_indices = which(test.data$Attrition == 0)
pr_curve_glm <- pr.curve(scores.class0 = predicted.step.under[class0_indices], scores.class1 = predicted.step.under[-class0_indices], curve=TRUE, max.compute = T, min.compute = T, rand.compute = T)
plot(pr_curve_glm, max.plot = FALSE, min.plot = FALSE, rand.plot = FALSE, fill.area = FALSE, color = 2, auc.main = FALSE, ylim = c(0, 1))
lr_auc<- as.numeric(pr_curve_glm$auc.integral)
cat("Area Under the Curve (AUC):", lr_auc, "\n")
## Area Under the Curve (AUC): 0.7236698
text(0.6, 0.4, paste("AUC =", round(lr_auc, 4)), col = "black", cex = 1.2)
We continued to use the ovun.sample method to oversample the training set. This method generates new instances of the minority class until the class distribution becomes balanced.
over.train.data <- ovun.sample(Attrition ~ .,
data = train.data, p = 0.5, method = "over", seed=108)$data
table(over.train.data$Attrition)
##
## 0 1
## 790 790
The Logistic Regression model trained on the oversampled training set is the following:
summary(full.model.over)
##
## Call:
## glm(formula = over.train.data$Attrition ~ . - JobLevel - YearsAtCompany,
## family = binomial, data = over.train.data)
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 6.45749422 0.91876344 7.028 0.000000000002088
## Age -0.02286485 0.00990843 -2.308 0.021021
## BusinessTravel 0.96740757 0.13393824 7.223 0.000000000000509
## Department -0.56014447 0.14099949 -3.973 0.000071071439327
## DistanceFromHome 0.03328940 0.00851547 3.909 0.000092570434441
## Education -0.17558726 0.06751774 -2.601 0.009306
## EducationField 0.15059333 0.04622681 3.258 0.001123
## EnvironmentSatisfaction -0.36734732 0.06093993 -6.028 0.000000001659772
## Gender -0.49374229 0.14279706 -3.458 0.000545
## JobInvolvement -0.54683399 0.09407276 -5.813 0.000000006140584
## JobRole 0.08545792 0.03457613 2.472 0.013451
## JobSatisfaction -0.34581483 0.06185903 -5.590 0.000000022658734
## MaritalStatus -0.66847192 0.12216833 -5.472 0.000000044566777
## MonthlyIncome -0.00011858 0.00002562 -4.628 0.000003687267979
## NumCompaniesWorked 0.14405261 0.02958525 4.869 0.000001121252301
## OverTime 1.96464365 0.14403286 13.640 < 0.0000000000000002
## PercentSalaryHike -0.06763351 0.02988017 -2.263 0.023605
## PerformanceRating 0.25723230 0.30833348 0.834 0.404131
## RelationshipSatisfaction -0.13572620 0.06298315 -2.155 0.031165
## StockOptionLevel -0.13767017 0.10450264 -1.317 0.187710
## TotalWorkingYears -0.01980694 0.01794954 -1.103 0.269819
## TrainingTimesLastYear -0.20833565 0.05788950 -3.599 0.000320
## WorkLifeBalance -0.53857485 0.09065867 -5.941 0.000000002838292
## YearsInCurrentRole -0.08302401 0.03376607 -2.459 0.013940
## YearsSinceLastPromotion 0.20436861 0.02931361 6.972 0.000000000003129
## YearsWithCurrManager -0.10446339 0.03281303 -3.184 0.001455
##
## (Intercept) ***
## Age *
## BusinessTravel ***
## Department ***
## DistanceFromHome ***
## Education **
## EducationField **
## EnvironmentSatisfaction ***
## Gender ***
## JobInvolvement ***
## JobRole *
## JobSatisfaction ***
## MaritalStatus ***
## MonthlyIncome ***
## NumCompaniesWorked ***
## OverTime ***
## PercentSalaryHike *
## PerformanceRating
## RelationshipSatisfaction *
## StockOptionLevel
## TotalWorkingYears
## TrainingTimesLastYear ***
## WorkLifeBalance ***
## YearsInCurrentRole *
## YearsSinceLastPromotion ***
## YearsWithCurrManager **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 2190.3 on 1579 degrees of freedom
## Residual deviance: 1420.9 on 1554 degrees of freedom
## AIC: 1472.9
##
## Number of Fisher Scoring iterations: 5
We can notice that the only variables with a p-value larger than 0.1 are PerformanceRating, StockOptionLevel and TotalWorkingYears.
These are the results:
##
## predicted.classes.over 0 1
## 0 190 12
## 1 79 37
## [,1] [,2] [,3] [,4] [,5]
## metric "Accuracy" "Precision" "Recall" "Specificity" "F1 score"
## values "0.7138365" "0.3189655" "0.7551020" "0.7063197" "0.4484848"
The threshold that maximizes recall is 0.4, which results in only 12 leaving employees being misclassified, with minimal impact on the F1 score. We will proceed with the backward selection using this threshold.
summary(model.step.over)
##
## Call:
## glm(formula = over.train.data$Attrition ~ Age + BusinessTravel +
## Department + DistanceFromHome + Education + EducationField +
## EnvironmentSatisfaction + Gender + JobInvolvement + JobRole +
## JobSatisfaction + MaritalStatus + MonthlyIncome + NumCompaniesWorked +
## OverTime + PercentSalaryHike + RelationshipSatisfaction +
## TrainingTimesLastYear + WorkLifeBalance + YearsInCurrentRole +
## YearsSinceLastPromotion + YearsWithCurrManager, family = binomial,
## data = over.train.data)
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 7.13574258 0.66537605 10.724 < 0.0000000000000002
## Age -0.02808470 0.00846169 -3.319 0.000903
## BusinessTravel 0.96695752 0.13348068 7.244 0.000000000000435076
## Department -0.56800665 0.13981499 -4.063 0.000048537675384434
## DistanceFromHome 0.03240924 0.00845645 3.832 0.000127
## Education -0.18294608 0.06732461 -2.717 0.006580
## EducationField 0.15298493 0.04612509 3.317 0.000911
## EnvironmentSatisfaction -0.36109862 0.06049928 -5.969 0.000000002392350087
## Gender -0.49626879 0.14166218 -3.503 0.000460
## JobInvolvement -0.54956292 0.09363125 -5.869 0.000000004372716457
## JobRole 0.08467057 0.03461326 2.446 0.014438
## JobSatisfaction -0.34676045 0.06168309 -5.622 0.000000018914706532
## MaritalStatus -0.77277184 0.09437820 -8.188 0.000000000000000266
## MonthlyIncome -0.00013414 0.00002134 -6.285 0.000000000327800097
## NumCompaniesWorked 0.13457742 0.02907479 4.629 0.000003680336249982
## OverTime 1.95492858 0.14342617 13.630 < 0.0000000000000002
## PercentSalaryHike -0.04718469 0.01869251 -2.524 0.011594
## RelationshipSatisfaction -0.12241701 0.06241537 -1.961 0.049841
## TrainingTimesLastYear -0.21132609 0.05750318 -3.675 0.000238
## WorkLifeBalance -0.55085496 0.08948812 -6.156 0.000000000747839158
## YearsInCurrentRole -0.08599768 0.03332432 -2.581 0.009862
## YearsSinceLastPromotion 0.19984752 0.02877159 6.946 0.000000000003757839
## YearsWithCurrManager -0.11173822 0.03242519 -3.446 0.000569
##
## (Intercept) ***
## Age ***
## BusinessTravel ***
## Department ***
## DistanceFromHome ***
## Education **
## EducationField ***
## EnvironmentSatisfaction ***
## Gender ***
## JobInvolvement ***
## JobRole *
## JobSatisfaction ***
## MaritalStatus ***
## MonthlyIncome ***
## NumCompaniesWorked ***
## OverTime ***
## PercentSalaryHike *
## RelationshipSatisfaction *
## TrainingTimesLastYear ***
## WorkLifeBalance ***
## YearsInCurrentRole **
## YearsSinceLastPromotion ***
## YearsWithCurrManager ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 2190.3 on 1579 degrees of freedom
## Residual deviance: 1424.5 on 1557 degrees of freedom
## AIC: 1470.5
##
## Number of Fisher Scoring iterations: 5
The algorithm removed exactly the columns PerformanceRating, StockOptionLevel and TotalWorkingYears that were just mentioned because of their high p-value.
##
## predicted.classes.step.over 0 1
## 0 187 12
## 1 82 37
## [,1] [,2] [,3] [,4] [,5]
## metric "Accuracy" "Precision" "Recall" "Specificity" "F1 score"
## values "0.7044025" "0.3109244" "0.7551020" "0.6951673" "0.4404762"
SMOTE (Synthetic Minority Over-sampling Technique) is a data augmentation that generates synthetic examples for the minority class.
train.data$Attrition = as.factor(train.data$Attrition)
smote.train.data <- smote(Attrition ~ ., train.data, perc.over = 2, k = 5)
table(smote.train.data$Attrition)
##
## 0 1
## 660 495
The model trained on the SMOTEd training set is:
full.model.smote <- glm(smote.train.data$Attrition ~ . - JobLevel - YearsAtCompany, data = smote.train.data, family = binomial)
summary(full.model.smote)
##
## Call:
## glm(formula = smote.train.data$Attrition ~ . - JobLevel - YearsAtCompany,
## family = binomial, data = smote.train.data)
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 7.98817675 1.17017194 6.826 0.0000000000087
## Age -0.02380278 0.01318624 -1.805 0.07106
## BusinessTravel 0.99400688 0.16883374 5.887 0.0000000039211
## Department -0.55772607 0.18027384 -3.094 0.00198
## DistanceFromHome 0.01666558 0.01039088 1.604 0.10874
## Education -0.13102640 0.08596606 -1.524 0.12747
## EducationField 0.23907772 0.06126245 3.903 0.0000951977891
## EnvironmentSatisfaction -0.41590030 0.07768143 -5.354 0.0000000860680
## Gender -0.72313274 0.17584468 -4.112 0.0000391672573
## JobInvolvement -0.71816512 0.12200275 -5.886 0.0000000039454
## JobRole 0.00273708 0.04392030 0.062 0.95031
## JobSatisfaction -0.38756096 0.07911320 -4.899 0.0000009641631
## MaritalStatus -0.63382441 0.16137871 -3.928 0.0000858123398
## MonthlyIncome -0.00007128 0.00003290 -2.167 0.03025
## NumCompaniesWorked 0.15884842 0.03872909 4.102 0.0000410433016
## OverTime 2.14256801 0.18107257 11.833 < 0.0000000000000002
## PercentSalaryHike -0.06611594 0.03849501 -1.718 0.08588
## PerformanceRating -0.00834256 0.38744912 -0.022 0.98282
## RelationshipSatisfaction -0.19309669 0.07925203 -2.436 0.01483
## StockOptionLevel -0.16553031 0.14071263 -1.176 0.23945
## TotalWorkingYears -0.06156952 0.02390990 -2.575 0.01002
## TrainingTimesLastYear -0.04272155 0.07287277 -0.586 0.55771
## WorkLifeBalance -0.72136300 0.11619370 -6.208 0.0000000005357
## YearsInCurrentRole -0.06124557 0.04689500 -1.306 0.19155
## YearsSinceLastPromotion 0.18391209 0.03766118 4.883 0.0000010430715
## YearsWithCurrManager -0.11015197 0.04666833 -2.360 0.01826
##
## (Intercept) ***
## Age .
## BusinessTravel ***
## Department **
## DistanceFromHome
## Education
## EducationField ***
## EnvironmentSatisfaction ***
## Gender ***
## JobInvolvement ***
## JobRole
## JobSatisfaction ***
## MaritalStatus ***
## MonthlyIncome *
## NumCompaniesWorked ***
## OverTime ***
## PercentSalaryHike .
## PerformanceRating
## RelationshipSatisfaction *
## StockOptionLevel
## TotalWorkingYears *
## TrainingTimesLastYear
## WorkLifeBalance ***
## YearsInCurrentRole
## YearsSinceLastPromotion ***
## YearsWithCurrManager *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 1577.52 on 1154 degrees of freedom
## Residual deviance: 978.29 on 1129 degrees of freedom
## AIC: 1030.3
##
## Number of Fisher Scoring iterations: 5
As it is evident from the summary, there are a lot of significant features that yield this result:
predicted.smote <- predict(full.model.smote, newdata = test.data, type = "response")
predicted.classes.smote <- ifelse(predicted.smote > 0.4, 1, 0)
confusion_matrix.s <- table(predicted.classes.smote, test.data$Attrition)
print(confusion_matrix.s)
##
## predicted.classes.smote 0 1
## 0 201 12
## 1 68 37
compute_metrics(confusion_matrix.s)
## [,1] [,2] [,3] [,4] [,5]
## metric "Accuracy" "Precision" "Recall" "Specificity" "F1 score"
## values "0.7484277" "0.3523810" "0.7551020" "0.7472119" "0.4805195"
The model with threshold 0.4 reaches a 75% recall with a 0.48% F1-score. The backward algorithm is able to reduce the AIC from 1444 to 1023, removing the columns StockOptionLevel, PerformanceRatings, TotalWorkingYears, TrainingTimesLastYear, YearsInCurrentRole. However, the performance ever so slighly decreases so we think it’s not worth mentioning it.
Instead of managing model complexity through subset selection methods, we can fit a model with all the predictors using techniques that constrain the coefficient estimates. In our case, we’ve chosen to apply Ridge and Lasso regression. Ridge regression uses quadratic shrinking, while Lasso regression uses absolute-value shrinking.
Ridge regularization, also known as L2 regularization, adds a penalty term to the loss function, which corresponds to the sum of the squared values of the model’s coefficients. This penalty term is essentially the L2 norm of the coefficients and is represented as:
\(\lambda\sum_{j=1}^{p}\beta_j^2\)
This regularization method encourages the model to keep the coefficients of the variables small, pushing them towards zero.
The shrinkage term is controlled by the hyperparameter \(\lambda\), which we will select using cross-validation. We will consider a range of possible values for \(\lambda\) from 100,000 to 0.001.
grid <- 10^seq(5, -3, length=100)
We define the model ridge.mod specifying the grid of lambda values for cross-validation.
X <- model.matrix(Attrition ~ ., data = HR_Analytics)
X <- X[,-1]
y <- HR_Analytics$Attrition
y.test <- y[test]
ridge.mod <- cv.glmnet(X[train,], y[train], alpha = 0,
lambda = grid, thresh = 1e-12, standardize = TRUE, nfold = 10)
bestlam <- ridge.mod$lambda.min
These are the results of unbalanced Ridge Regression using the optimal
lambda value:
bestlam
## [1] 0.04132012
ridge.pred <- predict(ridge.mod, s = bestlam, newx = X[test, ])
ridge.pred_class <- ifelse(ridge.pred > 0.3, 1, 0)
ridge_confusion_mat <- table(ridge.pred_class, y[test])
print(ridge_confusion_mat)
##
## ridge.pred_class 0 1
## 0 224 18
## 1 45 31
ridge.metrics <- compute_metrics(ridge_confusion_mat)
## [,1] [,2] [,3] [,4] [,5]
## metric "Accuracy" "Precision" "Recall" "Specificity" "F1 score"
## values "0.8018868" "0.4078947" "0.6326531" "0.8327138" "0.4960000"
From the plot and the coefficients printed below, it is evident that there is a particularly significant feature: OverTime. Other important variables are BusinessTravel, MaritalStatus , WorkLifeBalance and EnvironmentSatisfaction.
ridge.mod_un <- glmnet(X, y, alpha=0, lambda = grid, thresh = 1e-12, , standardize = TRUE)
plot(ridge.mod_un, xvar="lambda", label=TRUE)
selected_coeffs <- coefficients(ridge.mod_un, s = bestlam)
selected_coeffs
## 28 x 1 sparse Matrix of class "dgCMatrix"
## s1
## (Intercept) 0.815718444792
## Age -0.003105269867
## BusinessTravel 0.070970813263
## Department -0.064478371300
## DistanceFromHome 0.003352160742
## Education -0.003289655360
## EducationField 0.014208466222
## EnvironmentSatisfaction -0.041613003190
## Gender -0.044662639998
## JobInvolvement -0.061572386841
## JobLevel -0.022114823279
## JobRole 0.010732371093
## JobSatisfaction -0.030713194735
## MaritalStatus -0.053483541964
## MonthlyIncome -0.000003928772
## NumCompaniesWorked 0.012627405968
## OverTime 0.187414642316
## PercentSalaryHike -0.003307936985
## PerformanceRating 0.020944815172
## RelationshipSatisfaction -0.014681473209
## StockOptionLevel -0.021887701815
## TotalWorkingYears -0.003033177018
## TrainingTimesLastYear -0.010661248657
## WorkLifeBalance -0.032025598637
## YearsAtCompany 0.001492645195
## YearsInCurrentRole -0.006724181442
## YearsSinceLastPromotion 0.012412286374
## YearsWithCurrManager -0.008176316332
X1 <- model.matrix(Attrition ~ . , data = over.train.data)
X1 <- X1[,-1]
y <- over.train.data$Attrition
y <- as.numeric(over.train.data$Attrition)
X2 <- model.matrix(Attrition~ . , data = test.data)
X2 <- X2[,-1]
y.test <- test.data$Attrition
plot(ridge.mod)
## [1] 0.01353048
##
## ridge.pred_class.over 0 1
## 0 182 10
## 1 87 39
## [,1] [,2] [,3] [,4] [,5]
## metric "Accuracy" "Precision" "Recall" "Specificity" "F1 score"
## values "0.6949686" "0.3095238" "0.7959184" "0.6765799" "0.4457143"
By looking at the plot below, it is obvious that there are two features that strongly dominate over the others (which may justify the substantial improvement from the unbalanced Ridge model): 15 and 2, being respectively OverTime and BusinessTravel, followed by features like MaritalStatus, Gender , WorkLifeBalance and Department.
ridge.mod.ov <- glmnet(X1, y, alpha=0, lambda = grid, thresh = 1e-12, , standardize = TRUE)
plot(ridge.mod.ov, xvar="lambda", label=TRUE)
selected_coeffs <- coefficients(ridge.mod.ov, s = bestlam)
selected_coeffs
## 28 x 1 sparse Matrix of class "dgCMatrix"
## s1
## (Intercept) 1.504527005699
## Age -0.003176634840
## BusinessTravel 0.130021300151
## Department -0.086116496383
## DistanceFromHome 0.004306641853
## Education -0.021532507266
## EducationField 0.025528005502
## EnvironmentSatisfaction -0.049952831106
## Gender -0.062594796299
## JobInvolvement -0.077458789308
## JobLevel -0.052918108298
## JobRole 0.009872560862
## JobSatisfaction -0.052639285692
## MaritalStatus -0.082034709056
## MonthlyIncome -0.000006020989
## NumCompaniesWorked 0.022832081042
## OverTime 0.312540810420
## PercentSalaryHike -0.008242457602
## PerformanceRating 0.015645747225
## RelationshipSatisfaction -0.020579141966
## StockOptionLevel -0.030097795317
## TotalWorkingYears -0.005698733716
## TrainingTimesLastYear -0.029367835454
## WorkLifeBalance -0.079806078765
## YearsAtCompany 0.014265502589
## YearsInCurrentRole -0.018531128044
## YearsSinceLastPromotion 0.016854007204
## YearsWithCurrManager -0.019045082590
## Area Under the Curve (AUC): 0.7199852
Lasso regularization, also known as L1 regularization, adds a penalty term that consists of the sum of the absolute values of the model’s coefficients:
\(\lambda\sum_{j=1}^{p}|\beta_j|\)
This encourages the model’s coefficients to become smaller, and some of them may become exactly zero. In fact, Lasso is a sparsity-inducing approach where the nonzero coefficients left are associated with significant variables.
Similar to Ridge regularization, the penalty term is controlled by a hyperparameter \(\lambda\), which is selected through cross-validation. The search for the optimal \(\lambda\) involves exploring a grid of possible values ranging from 1000 to 0.0001
grid <- 10^seq(3, -4, length=100)
Let’s define the model lasso.mod, providing the grid of values from which the function cv.glmnet can peform cross-validation. It will allow to retrieve the \(\lambda\) that yields the best result.
X <- model.matrix(Attrition ~ . , data = HR_Analytics)
X <- X[,-1]
y <- HR_Analytics$Attrition
y.test <- y[test]
lasso.mod <- cv.glmnet(X[train, ], y[train], alpha = 1,
lambda = grid, thresh = 1e-12, standardize = TRUE, nfold = 10)
bestlam <- lasso.mod$lambda.min
plot(lasso.mod)
We can now let our model predict using the optimal \(\lambda\) obtained and a threshold of 0.4.
lasso.predict <- predict(lasso.mod, s = bestlam, newx = X[test, ])
lasso.predict_class <- ifelse(lasso.predict > 0.3, 1, 0)
lasso_confusion_mat <- table(lasso.predict_class, y[test])
lasso_confusion_mat
##
## lasso.predict_class 0 1
## 0 220 18
## 1 49 31
lasso.metrics <- compute_metrics(lasso_confusion_mat)
## [,1] [,2] [,3] [,4] [,5]
## metric "Accuracy" "Precision" "Recall" "Specificity" "F1 score"
## values "0.7893082" "0.3875000" "0.6326531" "0.8178439" "0.4806202"
The plot gives a clear representation of how the variable Overtime, in this model as well, proves to be the most meaningful. Other noteworthy predictors are BusinessTravel, Department and JobInvolvment.
ridge.mod2 <- glmnet(X, y, alpha=1, lambda = grid, thresh = 1e-12, , standardize = TRUE)
plot(ridge.mod2, xvar="lambda", label=TRUE)
selected_coeffs <- coefficients(lasso.mod, s = bestlam)
selected_coeffs
## 28 x 1 sparse Matrix of class "dgCMatrix"
## s1
## (Intercept) 0.966102060
## Age -0.003332553
## BusinessTravel 0.086191823
## Department -0.081402434
## DistanceFromHome 0.002877408
## Education -0.011615219
## EducationField 0.015104275
## EnvironmentSatisfaction -0.041219869
## Gender -0.054372346
## JobInvolvement -0.061638026
## JobLevel -0.045122365
## JobRole 0.012206188
## JobSatisfaction -0.028840010
## MaritalStatus -0.051117877
## MonthlyIncome .
## NumCompaniesWorked 0.015980572
## OverTime 0.235416118
## PercentSalaryHike -0.004795753
## PerformanceRating 0.006475347
## RelationshipSatisfaction -0.013226189
## StockOptionLevel -0.024477219
## TotalWorkingYears -0.003471853
## TrainingTimesLastYear -0.010878074
## WorkLifeBalance -0.052855235
## YearsAtCompany 0.005366176
## YearsInCurrentRole -0.009372052
## YearsSinceLastPromotion 0.011223508
## YearsWithCurrManager -0.009732903
Let’s see how Lasso Regularization performs using the oversampled training set.
As before, thanks to cross validation, we can retrive the best lambda:
X1 <- model.matrix(Attrition ~ . - JobLevel - YearsAtCompany, data = over.train.data)
X1 <- X1[,-1]
y <- over.train.data$Attrition
y <- as.numeric(over.train.data$Attrition)
X2 <- model.matrix(Attrition ~ . - JobLevel - YearsAtCompany, data = test.data)
X2 <- X2[,-1]
y.test <- test.data$Attrition
lasso.mod <- cv.glmnet(X1, y, alpha=1, lambda=grid, standardized=TRUE, nfold = 10)
plot(lasso.mod, label=TRUE)
bestlam.lasso.over <- lasso.mod$lambda.min
## [1] 0.001149757
lasso.pred.over <- predict(lasso.mod, s=bestlam.lasso.over, newx=X2, type="response")
lasso.over <- ifelse(lasso.pred.over > 0.44, 1, 0)
lasso_confusion_mat.over <- table(lasso.over, test.data$Attrition)
lasso_confusion_mat.over
##
## lasso.over 0 1
## 0 178 9
## 1 91 40
metrics.lasso <- compute_metrics(lasso_confusion_mat.over)
## [,1] [,2] [,3] [,4] [,5]
## metric "Accuracy" "Precision" "Recall" "Specificity" "F1 score"
## values "0.6855346" "0.3053435" "0.8163265" "0.6617100" "0.4444444"
As you can notice from the plot and the coefficients below, the variable OverTime once again contributes the most to the prediction.
ridge.mod2 <- glmnet(X1, y, alpha=1, lambda = grid, thresh = 1e-12, , standardize = TRUE)
plot(ridge.mod2, xvar="lambda", label=TRUE)
selected_coeffs <- coefficients(lasso.mod, s = bestlam)
selected_coeffs
## 26 x 1 sparse Matrix of class "dgCMatrix"
## s1
## (Intercept) 1.51227971311
## Age -0.00376031306
## BusinessTravel 0.13945122689
## Department -0.07992339425
## DistanceFromHome 0.00448770921
## Education -0.02216796756
## EducationField 0.02610664899
## EnvironmentSatisfaction -0.05476229299
## Gender -0.06616115435
## JobInvolvement -0.08494680847
## JobRole 0.01239492544
## JobSatisfaction -0.05159400366
## MaritalStatus -0.09181812427
## MonthlyIncome -0.00001698944
## NumCompaniesWorked 0.02109520326
## OverTime 0.33351819309
## PercentSalaryHike -0.00879581719
## PerformanceRating 0.02056910279
## RelationshipSatisfaction -0.02048049377
## StockOptionLevel -0.02564077293
## TotalWorkingYears -0.00308224581
## TrainingTimesLastYear -0.02968206648
## WorkLifeBalance -0.08637608177
## YearsInCurrentRole -0.01147142270
## YearsSinceLastPromotion 0.02704394225
## YearsWithCurrManager -0.01500106023
The Naive Bayes model is a classification model that is based on Bayes’ theorem. It assumes a strong independence among predictors, hence the ‘naive’ designation. However, it’s important to note that this independence condition is not satisfied, as observed in the exploratory data analysis (EDA), but we proceed with the model anyway.
Here follows the Naive Bayes model applied to the unbalanced training set, using a threshold of 0.4.
train.data$Attrition <- as.factor(train.data$Attrition)
naivebayes.model <- naive_bayes(Attrition ~ . - JobLevel - YearsAtCompany, data = train.data)
naivebayes.prediction <- predict(naivebayes.model, newdata = test.data, type = "prob")
naivebayes.y_pred_class <- ifelse(naivebayes.prediction[, "1"] > 0.4, 1, 0)
naivebayesmat <- table(naivebayes.y_pred_class, test.data$Attrition)
naivebayesmat
##
## naivebayes.y_pred_class 0 1
## 0 191 16
## 1 78 33
naivebayes.metrics <- compute_metrics(naivebayesmat)
## [,1] [,2] [,3] [,4] [,5]
## metric "Accuracy" "Precision" "Recall" "Specificity" "F1 score"
## values "0.7044025" "0.2972973" "0.6734694" "0.7100372" "0.4125000"
Using the oversampled training set and maintaining the same threshold, the recall improves by circa 10% percent. Of course, such tendency clearly penalizes the overall F1-score that drops by a lot.
##
## naivebayes.pred_over 0 1
## 0 132 11
## 1 137 38
## [,1] [,2] [,3] [,4] [,5]
## metric "Accuracy" "Precision" "Recall" "Specificity" "F1 score"
## values "0.5345912" "0.2171429" "0.7755102" "0.4907063" "0.3392857"
Linear Discriminant Analysis is a classification model that also aims at detecting a subspace that maximizes the separation between the classes while minimizing the variation within each class, so that our data lays in a subspace where they are well-separated.
LDA is based on two assumptions:
Let’s check if our features are normally distributed by defining the following function that allows us to visibly estimate the normality of our dataset columns by plotting the density and QQ plots. Furthermore, it computes the Shapiro-Wilk test that numerically assesses normality.
check.for.normality <- function(column) {
par(mfrow = c(1,2))
plot(density(column), main = paste("Density Plot of", colnames(HR_Analytics)[i]), xlab = "...")
qqnorm(y=HR_Analytics[,i],main= paste("QQ plot of",colnames(HR_Analytics)[i],sep = " "))
qqline(y=HR_Analytics[,i])
par(mfrow=c(1,1))
print(paste("Shapiro-Wilk Test",colnames(HR_Analytics)[i], sep=' '))
print(shapiro.test(column))
}
For brevity, we are only going to show the analysis of two variables: except for the variable Age, the Shapiro-Wilk test always results in a p-value smaller than 2.2e-16. We can conclude that our variables don’t have a normal distribution, but we are going to proceed with the application of LDA regardless, expecting that the model performance is affected by the fact that the two assumptions are not met.
## [1] "Shapiro-Wilk Test Age"
##
## Shapiro-Wilk normality test
##
## data: column
## W = 0.98063, p-value = 0.000000000004959
## [1] "Shapiro-Wilk Test Education"
##
## Shapiro-Wilk normality test
##
## data: column
## W = 0.89516, p-value < 0.00000000000000022
Here follows the application of LDA to our unbalanced training set. We are trying out different thresholds to determine which one improves the model performance, choosing from the usual range of values 0.4, 0.5 and 0.6.
lda.fit <- lda(Attrition ~ . - JobLevel - YearsAtCompany, data = train.data)
lda.pred <- predict(lda.fit, test.data, type="response")
lda.res <- lda.pred$posterior
for (t in c(0.4, 0.5, 0.6)){
lda.pred.best <- as.factor(ifelse(lda.res[,2] > t, 1, 0))
lda.conf.mat <- t(table(test.data$Attrition, lda.pred.best))
print(lda.conf.mat)
compute_metrics(lda.conf.mat)
lda.error <- mean(test.data$Attrition != lda.pred.best)
print(lda.error)
}
##
## lda.pred.best 0 1
## 0 246 24
## 1 23 25
## [,1] [,2] [,3] [,4] [,5]
## metric "Accuracy" "Precision" "Recall" "Specificity" "F1 score"
## values "0.8522013" "0.5208333" "0.5102041" "0.9144981" "0.5154639"
## [1] 0.1477987
##
## lda.pred.best 0 1
## 0 258 29
## 1 11 20
## [,1] [,2] [,3] [,4] [,5]
## metric "Accuracy" "Precision" "Recall" "Specificity" "F1 score"
## values "0.8742138" "0.6451613" "0.4081633" "0.9591078" "0.5000000"
## [1] 0.1257862
##
## lda.pred.best 0 1
## 0 261 33
## 1 8 16
## [,1] [,2] [,3] [,4] [,5]
## metric "Accuracy" "Precision" "Recall" "Specificity" "F1 score"
## values "0.8710692" "0.6666667" "0.3265306" "0.9702602" "0.4383562"
## [1] 0.1289308
Evidently, the threshold 0.6 induces a significantly higher recall if compared to the other two models, without penalizing excessively the F1-score. Such model also provides the lower mean error among the thresholds, so it’s easy to conclude that the model with threshold 0.4 has the best performance.
To keep it brief, we are reporting below only a few coefficients to prove that the main direction - the one with the highest magnitude - is OverTime. It is the variable that best separates the data, even though it is clear that it doesn’t do it sufficiently, just like we were expecting given the non linear separability of our data. Such property can be seen thanks to LDA: the density of the two classes are largely overlapping.
## [1] "NumCompaniesWorked 0.0952166695814974"
## [1] "OverTime 1.43402644528933"
## [1] "PercentSalaryHike -0.0394597894704209"
## [1] "PerformanceRating 0.153788365467926"
The results derived from the application of LDA model on the oversampled training data show quite a drastic improvement in the model’s performance.
lda.fit <- lda(Attrition ~ . - JobLevel - YearsAtCompany, data = over.train.data)
lda.pred_over <- predict(lda.fit, test.data, type="response")
lda.res_over <- lda.pred_over$posterior
lda.pred.best <- as.factor(ifelse(lda.res_over[,2] > 0.415, 1, 0))
lda.conf.mat <- t(table(test.data$Attrition, lda.pred.best))
print(lda.conf.mat)
##
## lda.pred.best 0 1
## 0 182 10
## 1 87 39
metrics.lda=compute_metrics(lda.conf.mat)
## [,1] [,2] [,3] [,4] [,5]
## metric "Accuracy" "Precision" "Recall" "Specificity" "F1 score"
## values "0.6949686" "0.3095238" "0.7959184" "0.6765799" "0.4457143"
lda.error <- mean(test.data$Attrition != lda.pred.best)
print(lda.error)
## [1] 0.3050314
plot(lda.fit)
Quadratic Discriminant Analysis is a classification model that exploits a quadratic decision boundary to separate the two classes of data. It is based on two main assumptions:
Each class is drawn from a normal distribution
Each class has a different covariance matrix.
The first assumptions is shared with LDA model: we have already proved that the dataset doesn’t follow a normal distribution. However, as we did with LDA, we are going to evaluate this model anyway, aware of its limitations due to the conditions not being satisfied.
We start with the unbalanced dataset. It yields the same results regardless of the threshold used.
qda.fit <- qda(Attrition ~ . - JobLevel - YearsAtCompany, data = train.data)
qda.pred <- predict(qda.fit, test.data, type="response")
qda.res <- qda.pred$posterior
qda.pred.best <- as.factor(ifelse(qda.res[,2] > 0.4, 1, 0))
qda.conf.mat <- t(table(test.data$Attrition, qda.pred.best))
print(qda.conf.mat)
##
## qda.pred.best 0 1
## 0 232 30
## 1 37 19
compute_metrics(qda.conf.mat)
## [,1] [,2] [,3] [,4] [,5]
## metric "Accuracy" "Precision" "Recall" "Specificity" "F1 score"
## values "0.7893082" "0.3392857" "0.3877551" "0.8624535" "0.3619048"
qda.error <- mean(test.data$Attrition != qda.pred.best)
print(qda.error)
## [1] 0.2106918
We obtain 21% mean error. It is visible that, regardless of the thresholds, a high specificity is achieved: the models barely misclassify the employees who don’t plan to leave the workplace. At the same time, the quite high precision guarantees that most of the employees predicted as leaving actually leave.
qda.fit <- qda(Attrition ~ . - JobLevel - YearsAtCompany, data = over.train.data)
qda.pred_over <- predict(qda.fit, test.data, type="response")
qda.res_over <- qda.pred_over$posterior
qda.pred.best <- as.factor(ifelse(qda.res_over[,2] > 0.4, 1, 0))
qda.conf.mat <- table(test.data$Attrition, qda.pred.best)
print(t(qda.conf.mat))
##
## qda.pred.best 0 1
## 0 197 29
## 1 72 20
metrics.qda<-compute_metrics(t(qda.conf.mat))
## [,1] [,2] [,3] [,4] [,5]
## metric "Accuracy" "Precision" "Recall" "Specificity" "F1 score"
## values "0.6823899" "0.2173913" "0.4081633" "0.7323420" "0.2836879"
qda.error <- mean(test.data$Attrition != qda.pred.best)
print(qda.error)
## [1] 0.3176101
KNN is an instance-based learning algorithm that is based on the assumptions that examples with similar features belong to the same class. Given an unseen example, the model assigns the label of the majority class among the K nearest neighbors. Since it involves the computation of distances between data examples, it is necessary to scale data so that features with a largest magnitude don’t dominate.
n <- nrow(HR_Analytics)
scaled.data <- as.data.frame(scale(HR_Analytics[-2]))
set.seed(108)
train <- sample(1:n, round(n * 0.75))
test <- setdiff(1:n, train)
train_data <- scaled.data[train, ]
train_data$Attrition <- HR_Analytics[train, 2]
test_data <- scaled.data[test, ]
test_data$Attrition <- HR_Analytics[test, 2]
over.train_data <- ovun.sample(Attrition ~ .,
data = train_data, p = 0.5, method = "over", seed = 108)$data
best_f1_score <- 0 # Initialize with a low value
best_k <- 0
consecutive_increases <- 0 # Initialize a counter for consecutive increases
for (K in 3:50) {
cat("\nK =", K)
predicted.classes <- knn(over.train_data, test_data, over.train_data$Attrition, K)
conf.matrix <- table(predicted.classes, test_data$Attrition)
metrics <- compute_metrics(conf.matrix)
current_f1_score <- metrics[2, which(metrics[1, ] == "F1 score")]
if (current_f1_score > best_f1_score) {
best_f1_score <- current_f1_score
best_k <- K
consecutive_increases <- 0 # Reset the counter
} else {
consecutive_increases <- consecutive_increases + 1
if (consecutive_increases >= 5) { # Adjust the number of consecutive increases as needed
break
}
}
}
##
## K = 3 [,1] [,2] [,3] [,4] [,5]
## metric "Accuracy" "Precision" "Recall" "Specificity" "F1 score"
## values "0.7704403" "0.3421053" "0.5306122" "0.8141264" "0.4160000"
##
## K = 4 [,1] [,2] [,3] [,4] [,5]
## metric "Accuracy" "Precision" "Recall" "Specificity" "F1 score"
## values "0.7484277" "0.3218391" "0.5714286" "0.7806691" "0.4117647"
##
## K = 5 [,1] [,2] [,3] [,4] [,5]
## metric "Accuracy" "Precision" "Recall" "Specificity" "F1 score"
## values "0.7295597" "0.3052632" "0.5918367" "0.7546468" "0.4027778"
##
## K = 6 [,1] [,2] [,3] [,4] [,5]
## metric "Accuracy" "Precision" "Recall" "Specificity" "F1 score"
## values "0.7327044" "0.3085106" "0.5918367" "0.7583643" "0.4055944"
##
## K = 7 [,1] [,2] [,3] [,4] [,5]
## metric "Accuracy" "Precision" "Recall" "Specificity" "F1 score"
## values "0.7484277" "0.3333333" "0.6326531" "0.7695167" "0.4366197"
##
## K = 8 [,1] [,2] [,3] [,4] [,5]
## metric "Accuracy" "Precision" "Recall" "Specificity" "F1 score"
## values "0.7452830" "0.3260870" "0.6122449" "0.7695167" "0.4255319"
##
## K = 9 [,1] [,2] [,3] [,4] [,5]
## metric "Accuracy" "Precision" "Recall" "Specificity" "F1 score"
## values "0.7515723" "0.3255814" "0.5714286" "0.7843866" "0.4148148"
##
## K = 10 [,1] [,2] [,3] [,4] [,5]
## metric "Accuracy" "Precision" "Recall" "Specificity" "F1 score"
## values "0.7547170" "0.3294118" "0.5714286" "0.7881041" "0.4179104"
##
## K = 11 [,1] [,2] [,3] [,4] [,5]
## metric "Accuracy" "Precision" "Recall" "Specificity" "F1 score"
## values "0.7547170" "0.3333333" "0.5918367" "0.7843866" "0.4264706"
##
## K = 12 [,1] [,2] [,3] [,4] [,5]
## metric "Accuracy" "Precision" "Recall" "Specificity" "F1 score"
## values "0.7421384" "0.3103448" "0.5510204" "0.7769517" "0.3970588"
cat("\nBest K value:", best_k)
##
## Best K value: 7
# Now, use the best K value to make predictions
knn.pred <- knn(over.train_data, test_data, over.train_data$Attrition, best_k)
conf.matrix <- table(knn.pred, test_data$Attrition)
conf.matrix
##
## knn.pred 0 1
## 0 206 19
## 1 63 30
metrics.knn <- compute_metrics(conf.matrix)
## [,1] [,2] [,3] [,4] [,5]
## metric "Accuracy" "Precision" "Recall" "Specificity" "F1 score"
## values "0.7421384" "0.3225806" "0.6122449" "0.7657993" "0.4225352"
We are only showing a few instances of KNN that evidently demonstrate
the behaviour of the model: specificity and accuracy
improve with K, which is an hyperparameter. In fact, since our test data
is highly imbalanced, as K increases, it becomes easier for the model to
classify any unseen data point as the majority class label.
| Model_name | Accuracy | Precision | Recall | F1_score |
|---|---|---|---|---|
| Lasso Regularization | 0.6855346 | 0.3053435 | 0.8163265 | 0.4444444 |
| Ridge Regularization | 0.6949686 | 0.3095238 | 0.7959184 | 0.4457143 |
| LDA | 0.6949686 | 0.3095238 | 0.7959184 | 0.4457143 |
| Naive Bayes | 0.5345912 | 0.2171429 | 0.7755102 | 0.3392857 |
| Logistic Regression | 0.7138365 | 0.3189655 | 0.7551020 | 0.4484848 |
| Logistic regression with Backward Selection | 0.7044025 | 0.3109244 | 0.7551020 | 0.4404762 |
| KNN | 0.7421384 | 0.3225806 | 0.6122449 | 0.4225352 |
| Logistic Regression Unbalanced | 0.8459119 | 0.5000000 | 0.4285714 | 0.4615385 |
| QDA | 0.6823899 | 0.2173913 | 0.4081633 | 0.2836879 |
The plot above explicitly states that the reported models present a similar behaviour: the AUC (Area Under Curve) doesn’t overly different among the different models.
We proceeded to analyze the misclassified instances for the logistic regression model after applying backward selection. In general, our models, with the right threshold, managed to achieve a good recall value, with a low number of misclassified false negatives but a higher number of false positives. Observing the following plots, it is noticeable that the instances misclassified the most are those belonging to categories with a higher probability of Attrition, as we have seen in the EDA. We showed a tendency to leave their work among employees who are single, work overtime, or travel frequently for work. Analyzing the misclassified instances, it’s evident that employers in these categories who don’t quit are more challenging to classify correctly, as they share many patterns with their colleagues who do leave. In the following plots, we observe this behavior for the features MaritalStatus, OverTime, and BusinessTravel.
test.data <- test.data %>%
mutate(Misclassified = ifelse(predicted.classes.over != test.data$Attrition, 1, 0),
MisclassifiedFP = ifelse(predicted.classes.over == 1 & test.data$Attrition == 0, 1, 0),
MisclassifiedFN = ifelse(predicted.classes.over== 0 & test.data$Attrition == 1, 1, 0))
marital_stats <- test.data %>%
group_by(MaritalStatus) %>%
summarise(CorrectPercentage = sum(1 - Misclassified) / n() * 100,
MisclassifiedFPPercentage = sum(MisclassifiedFP) / n() * 100,
MisclassifiedFNPercentage = sum(MisclassifiedFN) / n() * 100) %>%
pivot_longer(cols = c("CorrectPercentage", "MisclassifiedFPPercentage", "MisclassifiedFNPercentage"),
names_to = "PercentageType", values_to = "Percentage")
marital_stats$MaritalStatus <- factor(marital_stats$MaritalStatus, levels = c(1, 0, 2),
labels = c("Single", "Married", "Divorced"))
ggplot(data = marital_stats, aes(x = MaritalStatus, y = Percentage, fill = PercentageType)) +
geom_bar(stat = "identity", position = "stack") +
labs(title = "Percentage of Correctly and Misclassified Instances by MaritalStatus",
x = "MaritalStatus", y = "Percentage") +
scale_fill_manual(values = c("CorrectPercentage" = "#7FB942", "MisclassifiedFPPercentage" = "#B9427F", "MisclassifiedFNPercentage" = "#427FB9")) +
theme_minimal() +
theme(legend.position = "top")
business_travel_stats <- test.data %>%
group_by(BusinessTravel) %>%
summarise(CorrectPercentage = sum(1 - Misclassified) / n() * 100,
MisclassifiedFPPercentage = sum(MisclassifiedFP) / n() * 100,
MisclassifiedFNPercentage = sum(MisclassifiedFN) / n() * 100) %>%
pivot_longer(cols = c("CorrectPercentage", "MisclassifiedFPPercentage", "MisclassifiedFNPercentage"),
names_to = "PercentageType", values_to = "Percentage")
business_travel_stats$BusinessTravel <- factor(business_travel_stats$BusinessTravel,
levels = c(0, 1, 2),
labels = c("Non-Travel", "Travel_Rarely", "Travel_Frequently"))
ggplot(data = business_travel_stats, aes(x = BusinessTravel, y = Percentage, fill = PercentageType)) +
geom_bar(stat = "identity", position = "stack") +
labs(title = "Percentage of Correctly and Misclassified Instances by BusinessTravel",
x = "BusinessTravel", y = "Percentage") +
scale_fill_manual(values = c("CorrectPercentage" = "#7FB942", "MisclassifiedFPPercentage" = "#B9427F", "MisclassifiedFNPercentage" = "#427FB9")) +
theme_minimal() +
theme(legend.position = "top", plot.title = element_text(size = 12))
Employees who both travel frequently for work and work overtime are more likely to be misclassified.
## `summarise()` has grouped output by 'BusinessTravel'. You can override using
## the `.groups` argument.
In our exploratory data analysis, we observed that employees in managerial and research director roles tend to have longer tenures with the company and receive higher salaries. Consequently, they are less likely to leave their positions and are also less prone to misclassification. In contrast, sales representatives and HR_Analytics staff exhibit a different pattern, with a higher likelihood of both attrition and misclassification.
We then proceeded to visualize the distribution of monthly income, stratified into the correctly classified and misclassified.
We realized that our models have quite similar metrics and a tendency to overfit the test data. Ultimately, we decided to opt for the model with the highest recall - which is Lasso Regression - as our goal is to identify employees who may be considering leaving the company. Lasso Regression not only has the highest recall but also maintains a good balance with the other metrics, boosting the biggest area under the precision-recall curve. The actions the company should take to prevent attrition should not harm the employees who our model misclassifies, so it’s important to make these investments wisely. From the chosen model, it is evident that the most significant feature is Overtime. We recommend the company take this into consideration by potentially adjusting the overtime rates and establishing limits on weekly overtime hours. Furthermore, it’s crucial to gain a deeper understanding of the underlying factors contributing to this phenomenon. One way to do this is by conducting surveys to comprehend the workflow within the company. This survey can help identify if there are significant factors, especially related to teamwork, that may be slowing down productivity. It’s also essential to determine whether employees are voluntarily opting to work overtime due to financial needs or if it’s a result of a stressful or unhealthy work environment. The second most influential variable is BusinessTravel: undoubtely, the most intuitive action a company can take is to guarantee a better compensation for travel expenses. Nevertheless, our final suggestion for any company would be to submit periodic surveys in order to have a constantly updated picture of the well-being and intentions of the employees allowing to have an accurate model prediction. Obviously, such predictions must be continuously taken in consideration so that timely measures can be adopted to limit any loss. We further propose to integrate more specific questions into the survey:
ManagerSatisfaction: evaluation from 1 to 4 of the manager;
TeamWork: evaluation from 1 to 4 of the quality of the environment and workflow in the team.